dsv2Filter by dbtsai · Pull Request #10 · dbtsai/spark

dbtsai · 2019-11-12T01:50:33Z

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

… Arrow on JDK9+ ### What changes were proposed in this pull request? This PR aims to add `io.netty.tryReflectionSetAccessible=true` to the testing configuration for JDK11 because this is an officially documented requirement of Apache Arrow. Apache Arrow community documented this requirement at `0.15.0` ([ARROW-6206](apache/arrow#5078)). > #### For java 9 or later, should set "-Dio.netty.tryReflectionSetAccessible=true". > This fixes `java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available`. thrown by netty. ### Why are the changes needed? After ARROW-3191, Arrow Java library requires the property `io.netty.tryReflectionSetAccessible` to be set to true for JDK >= 9. After apache#26133, JDK11 Jenkins job seem to fail. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/676/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/677/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/678/ ```scala Previous exception in task: sun.misc.Unsafe or java.nio.DirectByteBuffer.<init>(long, int) not available
 io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:473)
 io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243)
 io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233)
 io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245)
 org.apache.arrow.vector.ipc.message.ArrowRecordBatch.computeBodyLength(ArrowRecordBatch.java:222)
 ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with JDK11. Closes apache#26552 from dongjoon-hyun/SPARK-ARROW-JDK11. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

viirya

So basically how to push down nested column predicates are implemented by each v2 data source. The main change here looks like:

Add v2 Filter and subclasses.
translate catalyst predicate to v2 Filter
Replace v1 Filter with v2 Filter in Orc filter helper.

viirya · 2019-12-19T07:42:29Z

sql/catalyst/src/main/java/org/apache/spark/sql/sources/v2/FilterV2.java

+import org.apache.spark.sql.connector.expressions.NamedReference;
+
+@Experimental
+public abstract class FilterV2 {


The package is already v2. Do we need add v2?

viirya · 2019-12-19T07:44:37Z

sql/catalyst/src/main/java/org/apache/spark/sql/sources/v2/FilterV2.java

+   */
+  public abstract NamedReference[] references();
+
+  protected NamedReference[] findReferences(Object valve) {


valve -> value?

viirya · 2019-12-19T07:50:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala

 /**
 * Methods that can be shared when upgrading the built-in Hive.
 */
 trait OrcFiltersBase {


Is this for v2?

viirya · 2019-12-19T07:54:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileFormatV2.scala

+/**
+ * The base class file format that is based on text file.
+ */
+abstract class TextBasedFileFormat extends FileFormatV2 {


I seems not see TextBasedFileFormat and FileFormatV2 are used?

rdblue · 2020-01-17T18:22:58Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsDelete.java

   * @throws IllegalArgumentException If the delete is rejected due to required effort
   */
-  void deleteWhere(Filter[] filters);
+  void deleteWhere(FilterV2[] filters);


I think that a good way to switch between v1 filters and v2 filters is to add both methods and convert from v2 to v1 in a default implementation of the v2 version. That's an easy way for people to update to the new filter API.

rdblue · 2020-01-17T18:24:10Z

sql/catalyst/src/main/java/org/apache/spark/sql/sources/v2/FilterV2.java

+import org.apache.spark.sql.connector.expressions.NamedReference;
+
+@Experimental
+public abstract class FilterV2 {


Do we want to use Filter or should we use Predicate for expressions that evaluate to a boolean?

rdblue · 2020-01-17T18:25:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/sources/v2/filters.scala

+ * @since 3.0.0
+ */
+@Experimental
+case class EqualTo(ref: NamedReference, value: Any) extends FilterV2 {


Can FilterV2 extend the v2 Expression base?

rdblue · 2020-01-17T18:27:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/sources/v2/filters.scala

+ */
+@Experimental
+case class EqualTo(ref: NamedReference, value: Any) extends FilterV2 {
+  override def references: Array[NamedReference] = Array(ref) ++ findReferences(value)


Why is value Any? Shouldn't it be an expression (like the v2 Literal)?

Also, in Iceberg expressions we've updated ref to be a Term instead of a Reference. Both Reference and Transform are terms, which allows us to express that the value of a transformed reference is equal to something. That gives us the ability to express date(ts) = '2020-01-17', for example.

rdblue · 2020-01-17T18:29:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/sources/v2/filters.scala

+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2


I think this should be in the connector.expressions package.

rdblue · 2020-01-17T18:30:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileFormatV2.scala

+/**
+ * Used to read and write data stored in files to/from the [[InternalRow]] format.
+ */
+trait FileFormatV2 {


Since the API already supports v1 Filter, I don't think we need to make these changes. We should just continue to support the v1 filters for older sources. That decouples these changes from updates to the file sources.

rdblue · 2020-01-17T18:36:00Z

@dbtsai, this looks like a great start to me. I'd really like to see a v2 API for predicates/filters. One thing that's missing is that the v2 API is written as Java interfaces. Spark has its own implementations that are case classes, but we do need the Java interfaces for the new filter expressions defined.

I'd also recommend creating extractor functions like the ones we created to work with transforms. Those allow us to seamlessly use the Spark internal class names even if the source has returned a different implementation of the Java interface.

github-actions · 2020-04-27T00:09:30Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

dbtsai changed the title ~~rebase~~ dsv2Filter Nov 12, 2019

dbtsai force-pushed the dsv2Filter branch 2 times, most recently from 56cc73b to 3372321 Compare November 14, 2019 22:47

temp

d90cc06

dbtsai force-pushed the dsv2Filter branch from ad4c5ac to d90cc06 Compare December 12, 2019 01:28

temp

f59d0d3

viirya reviewed Dec 19, 2019

View reviewed changes

rdblue reviewed Jan 17, 2020

View reviewed changes

MaxGekk mentioned this pull request Jan 31, 2020

[SPARK-30648][SQL] Support filters pushdown in JSON datasource apache/spark#27366

Closed

github-actions bot added the Stale label Apr 27, 2020

github-actions bot closed this Apr 28, 2020

sunchao mentioned this pull request Jun 18, 2021

[SPARK-35779][SQL] Dynamic filtering for Data Source V2 apache/spark#32921

Closed

Conversation

dbtsai commented Nov 12, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jan 17, 2020

Uh oh!

github-actions bot commented Apr 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants